home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
AmigActive 10
/
AACD 10.iso
/
AACD
/
Utilities
/
GOCR
/
README
< prev
next >
Wrap
Text File
|
2000-05-29
|
8KB
|
183 lines
--- GOCR v0.2.x ---
What is it?
- OCR = optical character recognition
- read pnm, pbm, pgm, ppm, some pcx and tga image files
- output of text
How to compile?
gzip -cd gocr_0_2_xx.tgz | tar xfv - # extract the files
# edit the Makefile if necessary
# change option -O0 to -O2 (Makefile) for optimization
cd gocr_0_2 # change directory
make # you need gcc/g++; other c++ compiler should work too
How to start?
gocr -h # help
gocr -i file.pbm # minimum option
gocr -v 1 -v 32 -m 4 -i file.pbm # layout analyzing and out30.bmp output
How to get image files?
- scan text pages and save it as PGM/PBM file
- you can use PNM-tools converting several image formats into pbm/pgm
- djpeg can be used to convert jpeg into pgm
djpeg -grayscale -outfile file.pgm infile.jpg
- generate your own using TeX+DVIPS+GS or other programs
- generate my examples: make font1.pbm font2.pbm
WARNING!!!
If you use a 300dpi scan of A4 letter, the image is about 2500x3500
and gocr requires 8.75MB for storing the picture into the memory.
May be, the program needs a 2nd copy. As a result, gocr takes 17MB memory.
This is independent of using b/w or gray-scale images.
Be sure that you have enough RAM installed in your machine!
As a alternate way you can cut the picture into small pieces.
That can be done by:
jconv -shrink -pbm bigfile.pbm part1.pbm 0 0 0 1000;gocr -i part1.pbm
jconv -shrink -pbm bigfile.pbm part2.pbm 0 0 1000 1000;gocr -i part2.pbm
jconv -shrink -pbm bigfile.pbm part3.pbm 0 0 2000 1000;gocr -i part3.pbm
Dependencies:
- gcc,binutils (or other c-compiler)
- LaTeX,dvips,ghostscript to create pbm-examples
Features:
- fonts 20-60 pixels ( 5pt * 1in/72pt * 300 dpi = 20 dots )
- output of image file for controling detection
- speed is very slow (this will be changed when recognition works well)
12pt 300dpi 1700x950 16lines 700chars 22x28 P90=40s..90s v0.2.3 (gcc -O0)
What does >> NOT << work at the moment:
- complex layouts (try option -m 4)
- bad scanns, noisy/snowy images, FAX-quality images
- serif fonts, italic fonts, slanted fonts
- handwritten texts (this is valid for the next ten years)
- rotated images but slightly rotated images should be no problem
- small fonts (fax like) or mix of different font size
- colored images (use black on white!)
- chinese, arabian, agyptian, kyrillic or klingon fonts
- using database (create_db is for developper tests)
How it works or how it should work?
- put the entire file into RAM (300dpi grayscale recommended)
- remove dust and snow
- detect small angle (lines which are not horizontaly)
- detect text boxes (option -m 4)
- detect text-lines
- detect characters
- first step recognition (every character has its own empirical procedure)
- no neural network or similar general algorithms
- analyze not detected chars by comparison with detected ones
- try to divide overlapping letters
- testwise: compare all letters (like compression of pictures)
- for more details look to the ocr.tex documentation
How can I optimize the result?
- make good scanns
- try to change the critical gray level (option -l <n>)
- control the result on out30.bmp (option -v 32)
- enlarge option -d <n> for high resolution images which are noisy
- try different combinations for option -m <n>
ToDo (no particular order):
- frames should be recognized
- rearange dust and line detection using box-list (will give speed up)
- better character recognition of course (enter the top ten of OCR-PGs)
- introduce propability and alternative chars
- better distance function (comparision of characters)
- detection of orientation (i.a. 90,180,270deg rotation)
- learn mode (kind of database)
- documentation, ocr.tex (How does this program work.)
- x11-frontend (GTK+,TCL ???)
- making a good interface for other applications
- using dictonary (ispell)
- switch to C or C++, what is better ???
- picture extraction
- making the code better readable
- HTML (or other formatted) output
- math formula detection, font type detection
- set up a CVS server
- feature extraction and classification (other engine)
that is the most difficult and most important task I think
- improve performance (also parallel processing etc.)
- handwritten texts
--- uff, realy a lot of work ---
- Feel free and add your suggestions and wishes,
or tell me, what is the most importend point for you.
How can you help me?
- Send comments, ideas and sources or SMALL example files as .pbm.gz or jpeg.
- If you have a lot of money, spend a bit for a small notebook,
so I can improve the program everywhere.
I am also interested on buying a cheap mininotebook (6-10"TFT,ext. CD+FD via USB).
- At the moment I really need example files (.pbm.gz or jpeg <100kB) for testing
the behavior of the ocr engine under different conditions,
because scanning does take a lot of time which I do not have.
But do not send files, which are not convertable by commercial ocr programs
or which are protected for copying and electronic processing by copyright.
That will help, to get the world best OCR open source program. :) Thanks!
- Send me your results (errors,num_chars,dpi) and if possible results
and name of professional OCR programs for statistics.
History: (Changes)
- v0.1 project started (not documented), summer 1999
- v0.2 line scanning added
v0.2.1 first official release on freshmeat.net March 2000
v0.2.2 gocr_0_2.tgz expands into gocr_0_2 directory (thanks to zz99zz)
engine upgraded a bit, some bugs fixed (umlaut, thin lines)
short documentation added (ocr.tex)
colored output (out30.bmp) for test/development-mode
- read ASC-PBM and PCX (1 bit) were buggy
v0.2.3 some layout analysis (very slowly, try -m 4)
engine modified, ... still a lot to do
v0.2.3b better (?) distance function, engine updated
- database added for testing
1000 downloads counted !!! May 2000
v0.2.4 three char division (connected chars), dust removing
v0.2.4a2 some details are added (better dust removing and char division)
v0.2.4a3 convert renamed to jconv
lot of people is happy about the program, a good motivation for me ;)
v0.2.4a4 you can choose stdin as input now,
that gives you full power of conversion tools
example: djpeg -pnm -gray text.jpg | gocr -i -
Bugs:
Please do not hesitate to report every errors!
And if possible its fixes!
Good ideas are always welcome!
- if you send me a example file, please only use XXXXX.pbm.gz
- v0.2.1
- some people has problems running gocr on DOS/Win95
I guess: stack overflow. Is someone able to analyze or fix this?
- large black areas on pbm-files cause a segfault on
Ultra/Sparc (64bit) machines running Linux (2.1.126).
There is a rekursive function in the program which causes a
stack overflow, which is not detected by the linux-kernel (BUG?).
I look for a better solution.
- v0.2.3 still problems with segmentation
- gcc 2.95.2 (SuSE6.4) error in load_db(), => fixed (thx to jasper)
- v0.2.4 I guess, there are still bugs.
Latest news:
http://altmark.nat.uni-magdeburg.de/~jschulen/ocr/index.html
Authors:
Joerg.Schulenburg@physik.uni-magdeburg.de
Thanks:
...to everyone who contributed to gocr. If you feel that your
name should be in this list, write mail to the author. These
are in no particular order:
G.Kugler for sending me example files and testing. (MaiMM)
...
... and everyone else who submitted bug-reports,
feature-requests and patches.